Our group was initially interested in exploring Public Health-related datasets due to our desire of exploring geospatial details of disease and how that affects different geographies. COVID-19 was a huge influence to us wanting to experiment with another form of health concern. This specific dataset includes cases of the Dengue Fever in the city of Merida, from 2012 to 2015, and outlines the coordinates of where every case specifically took place. Through our analysis, we plan on visualizing how the Dengue Fever has spread across Merida over a four-year period, and how it can help inform how diseases typically spread in urban environments.
Before developing any visualizations or analytical models, we must first prepare the R environment and load our primary dataset, “Merida_Den12_13_14_15.csv”. To accomplish this, and to equip our notebook for the complex analysis that follows, we first load a complete suite of R packages. We load the tidyverse package, which provides core tools like dplyr for data transformation and ggplot2 for static visualization. For our more advanced interactive plots, we load plotly (used for the animated density map) and leaflet (used for the interactive Year-over-Year map). We also include sf to read and handle the simple features (geospatial) data for Merida. Once all libraries are loaded, we use the read.csv() function to import the dataset into a data frame and verify its contents and structure.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(leaflet)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(htmltools)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
data <- read.csv('Merida_Den12_13_14_15.csv')
head(data)
## X Y SP_ID CVEGEO Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510 0 3104100010070 17 6 2 2
## 2 231323.1 2316622 1 3104100010085 35 15 10 18
## 3 230769.7 2316113 2 310410001009A 35 14 8 17
## 4 233250.8 2316208 3 3104100010136 40 8 6 10
## 5 233266.3 2317442 4 3104100010140 12 8 4 4
## 6 232736.6 2317238 5 3104100010155 30 8 1 5
data_cleaned <- data %>%
select(X, Y, starts_with("Den"))
na_counts <- colSums(is.na(data_cleaned))
print(na_counts)
## X Y Den2012 Den2013 Den2014 Den2015
## 0 0 0 0 0 0
head(data_cleaned)
## X Y Den2012 Den2013 Den2014 Den2015
## 1 231639.6 2317510 17 6 2 2
## 2 231323.1 2316622 35 15 10 18
## 3 230769.7 2316113 35 14 8 17
## 4 233250.8 2316208 40 8 6 10
## 5 233266.3 2317442 12 8 4 4
## 6 232736.6 2317238 30 8 1 5
Data Cleaning is a crucial step to ensure the integrity and relevance of the data prior to our analysis. We want to ensure that all the data required for this project is clean, validated, and standardized so that our group can create data visualizations using the shared data. The raw dataset contains several columns, including SP_ID and GEOID, which serve as identifiers for position and case record. After discussing the goals for the project, we determined that our exploratory analysis is focused exclusively on the relationship between spatial coordinates and case counts. Therefore, the important features for this analysis are the X and Y coordinates and the annual case count columns (Den2012 through Den2015). The SP_ID and GEOID columns, while potentially useful for future work (e.g., joining with demographic or census data), are not required for our current visualization and modeling goals. Therefore, we dropped the project’s irrelevant columns, creating a standard data frame named data_cleaned. The second important step is to check for missing data. Missing data can introduce significant bias, propagate errors in calculations, and cause visualization failures. The R console output confirms that all columns returned a count of “0”. This finding validates the completeness of our dataset and we can therefore proceed with our analysis without the need for data imputation or row-wise deletion, ensuring that our results are based on the full set of observations.
# This plot shows the number of Dengue Fever cases in the city of Merida from 2012 to 2015, while also highlighting the Year-over-Year percent change of Dengue Fever cases from the previous year. Dengue Fever clearly has the highest amount of cases in 2012, however, sees a ~50% drop in cases in the following 2013 year. The number of cases still drops going into 2014, however, the number of cases spikes up by almost 200% from 2014 to 2015. This drastic increase in Dengue Fever cases in 2015 can lead us to believe that some sort of environmental or external effect led to cases going up to nearly 75,000. We could also hypothesize that outbreaks typically last around three years, before another outbreak starts.
data_formatted <- data %>%
pivot_longer(
cols = starts_with("Den"),
names_to = "Year",
values_to = "Count"
) %>%
mutate(Year = as.numeric(gsub("Den", "", Year)))
data_summary <- data_formatted %>%
group_by(Year) %>%
summarise(
total_Dengue = sum(Count, na.rm = TRUE))%>%
arrange(Year) %>%
mutate(
percent_change = (total_Dengue - lag(total_Dengue))/lag(total_Dengue) * 100
)
ggplot(data_summary, aes(x = Year)) +
geom_line(aes(y = percent_change * 100), color = "red", size = 1.2) +
scale_y_continuous(
name = "Total Dengue Cases",
sec.axis = sec_axis(~./100, name = "% Change from Previous Year")
) +
geom_col(aes(y=total_Dengue), fill = "blue")+
theme_minimal(base_size = 12) +
labs(
title = "Total Dengue Cases and Year-over-Year % Change in Merida",
x = "Year"
)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
#maybe overlay the map of the location
# This plot shows the geospatial location of Dengue incidence in Merida, Mexico. We can potentially aesthetically improve this plot later using QGIS or ArcGIS.
plot(data$X, data$Y,
main = "Spatial Locations of Sampling Points",
xlab = "X Coordinate",
ylab = "Y Coordinate",
pch = 19, col = "blue")
library(ggplot2)
# 1️ Read file
library(sf)
town_map <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source
## `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
st_crs(town_map)
## Coordinate Reference System:
## User input: Mexico ITRF2008 / UTM zone 16N
## wkt:
## PROJCRS["Mexico ITRF2008 / UTM zone 16N",
## BASEGEOGCRS["Mexico ITRF2008",
## DATUM["Mexico ITRF2008",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",6365]],
## CONVERSION["UTM zone 16N",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",-87,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",0.9996,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["Mexico east of 90°W, onshore and offshore."],
## BBOX[17.81,-90,25.77,-84.64]],
## ID["EPSG",6371]]
disease_data <- read.csv('Merida_Den12_13_14_15.csv')
# make sure CRS is same, convert to sf
disease_sf <- st_as_sf(disease_data, coords = c("X", "Y"), crs = st_crs(town_map))
names(disease_data)
## [1] "X" "Y" "SP_ID" "CVEGEO" "Den2012" "Den2013" "Den2014"
## [8] "Den2015"
# 4 generate plot
ggplot() +
geom_sf(data = town_map, fill = "gray95", color = "black", linewidth = 0.3) +
geom_sf(data = disease_sf, aes(color = Den2012, size = Den2012)) +
scale_color_viridis_c(option = "plasma") +
theme_minimal() +
labs(title = "Spatial Locations",
color = "Incidence",
size = "Incidence")
data_long <- data_cleaned %>%
pivot_longer(
cols = starts_with("Den"),
names_to = "Year",
names_prefix = "Den",
values_to = "Cases"
) %>%
mutate(Year = as.integer(Year))
global_max_cases <- max(data_long$Cases)
fig <- plot_ly(
data_long,
x = ~X,
y = ~Y,
frame = ~Year,
type = 'scatter',
mode = 'markers',
color = ~Cases,
colors = "Reds",
marker = list(
size = 10,
line = list(width = 0.5,color = 'black')
),
text = ~paste(
"Year:", Year,
"<br>Cases:", Cases,
"<br>X:", round(X, 2),
"<br>Y:", round(Y, 2)
),
hoverinfo = 'text'
) %>%
colorbar(title = "Case Count") %>%
layout(
title = 'Dengue Fever Density in Merida (2012–2015)',
xaxis = list(title = 'X Coordinate'),
yaxis = list(title = 'Y Coordinate'),
plot_bgcolor = "white",
paper_bgcolor = "white"
) %>%
animation_opts(
frame = 1500,
transition = 500,
easing = "linear"
) %>%
animation_slider(
currentvalue = list(
prefix = "Year: ",
font = list(color = "red")
)
)
fig
We can further analyze the evolution Dengue virus over a period of time with an animated density chart. By using plotly, we can visualize high-density infected regions in red, compared to low-density infected regions in white, and due to the animation, can see the spread of the disease throughout the 2012-2015 timespan. When analyzing the density chart in 2012, we can see a significant outbreak as there are two primary hotspot regions with case counts exceeding 100. It is important to note that these hotspots are not isolated. They are surrounded by Dengue-infested locations reporting nearly 50 cases, indicating a concentrated and widespread initial event. If we advance to 2013, we can see a surprising city-wide reduction in Dengue incidence. There are far fewer cases overall, with no single region in Merida reporting over 100 cases and only a handful approaching 50 cases. Overall, the area is dominated by low-incidence areas, suggesting the outbreak has died down. This trend continues into 2014, which shows an even more pronounced decline as the vast majority of regions report zero cases, and the most significant hotspot contains only around 30 cases. However, when analyzing 2015, we can see a powerful resurgence. While many areas remain low-incidence, two new, intense hotspots emerge with case counts approaching 150. Additionally, the space between these two hotspots contains a dense, concentrated cluster of locations reporting nearly 100 cases each. To sum up, this visualization demonstrates that disease spread in this urban environment is not random, but rather persistent and spatially clustered in specific, identifiable high-risk zones. The 2015 outbreak also demonstrates that new high-incidence areas are likely to appear adjacent to or within previous outbreak zones, confirming the geospatial nature of disease spread.
# 1. Convert data to sf and transform to WGS84 for Leaflet
# We use EPSG:6371 as identified in the original Section 7
data_sf <- st_as_sf(data, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)
# 2. Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source
## `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)
# 3. Create a sequential color palette for absolute cases
# We find the global max to keep the scale consistent across years
all_cases <- c(data_sf_wgs84$Den2012, data_sf_wgs84$Den2013, data_sf_wgs84$Den2014, data_sf_wgs84$Den2015)
global_max_cases <- max(all_cases, na.rm = TRUE)
pal_cases <- colorNumeric(
palette = "Reds",
domain = c(0, global_max_cases),
na.color = "#a0a0a0"
)
# 4. Create popups for each year
popup_2012 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2012 Cases: %d", SP_ID, Den2012), HTML)
popup_2013 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2013 Cases: %d", SP_ID, Den2013), HTML)
popup_2014 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2014 Cases: %d", SP_ID, Den2014), HTML)
popup_2015 <- ~lapply(sprintf("<strong>ID: %s</strong><br/>2015 Cases: %d", SP_ID, Den2015), HTML)
# 5. Create the interactive leaflet map
m_cases <- leaflet(data_sf_wgs84) %>%
# Add base map layers
addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
# Add Neighborhoods layer
addPolygons(
data = town_map_wgs84,
group = "Neighborhoods",
fillOpacity = 0.1,
color = "#444",
weight = 1,
opacity = 0.7,
popup = ~SP_ID
) %>%
# Set the map to focus on the data
fitBounds(
~as.numeric(st_bbox(geometry)[1]),
~as.numeric(st_bbox(geometry)[2]),
~as.numeric(st_bbox(geometry)[3]),
~as.numeric(st_bbox(geometry)[4])
) %>%
# ==== Layer 1: 2012 Cases ====
addCircleMarkers(
group = "2012 Cases",
color = ~pal_cases(Den2012),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2012,
label = ~as.character(Den2012)
) %>%
# ==== Layer 2: 2013 Cases ====
addCircleMarkers(
group = "2013 Cases",
color = ~pal_cases(Den2013),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2013,
label = ~as.character(Den2013)
) %>%
# ==== Layer 3: 2014 Cases ====
addCircleMarkers(
group = "2014 Cases",
color = ~pal_cases(Den2014),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2014,
label = ~as.character(Den2014)
) %>%
# ==== Layer 4: 2015 Cases ====
addCircleMarkers(
group = "2015 Cases",
color = ~pal_cases(Den2015),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2015,
label = ~as.character(Den2015)
) %>%
# ==== Legend ====
addLegend(
pal = pal_cases,
values = c(0, global_max_cases),
title = "Case Count",
position = "bottomright",
opacity = 1
) %>%
# ==== Layer Control ====
addLayersControl(
baseGroups = c("2012 Cases", "2013 Cases", "2014 Cases", "2015 Cases"),
overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
options = layersControlOptions(collapsed = FALSE)
)
# Display the map
m_cases
# 1. Calculate Year-over-Year (YoY) percentage change
# We must handle cases where the denominator (previous year) is 0.
calculate_yoy <- function(current, previous) {
case_when(
previous == 0 & current == 0 ~ 0, # 0 to 0 is 0% change
previous == 0 & current > 0 ~ Inf, # 0 to N is infinite % change
previous > 0 ~ (current - previous) / previous
)
}
data_yoy <- data %>%
mutate(
YoY_2013 = calculate_yoy(Den2013, Den2012),
YoY_2014 = calculate_yoy(Den2014, Den2013),
YoY_2015 = calculate_yoy(Den2015, Den2014)
)
# 2. Clean up Inf/-Inf/NaN values for visualization. We'll set Inf to NA.
data_yoy <- data_yoy %>%
mutate(across(starts_with("YoY_"), ~if_else(is.finite(.), ., NA_real_)))
# 3. Convert to a geospatial 'sf' object and transform coordinates
# The .prj file (EPSG:6371) tells us this is Mexico ITRF2008 / UTM zone 16N.
# Leaflet requires standard WGS84 (EPSG:4326) coordinates (lat/lon).
data_sf <- st_as_sf(data_yoy, coords = c("X", "Y"), crs = 6371)
data_sf_wgs84 <- st_transform(data_sf, crs = 4326)
# 4. Define a diverging color palette
# We cap the domain from -1 (-100%) to 2 (200%) to handle outliers
cap_range <- c(-1, 2)
pal_yoy <- colorNumeric(
palette = "RdYlBu",
domain = cap_range,
na.color = "#a0a0a0",
reverse = TRUE # Make red positive (increase), blue negative (decrease)
)
# 5. Create formatted popup and label content
# Popups appear on click
popup_2013 <- ~lapply(sprintf(
"<strong>Location ID: %s</strong><br/>2012 Cases: %d<br/>2013 Cases: %d<br/>YoY Change: %s",
SP_ID, Den2012, Den2013, scales::percent(YoY_2013, accuracy = 0.1)
), HTML)
popup_2014 <- ~lapply(sprintf(
"<strong>Location ID: %s</strong><br/>2013 Cases: %d<br/>2014 Cases: %d<br/>YoY Change: %s",
SP_ID, Den2013, Den2014, scales::percent(YoY_2014, accuracy = 0.1)
), HTML)
popup_2015 <- ~lapply(sprintf(
"<strong>Location ID: %s</strong><br/>2014 Cases: %d<br/>2015 Cases: %d<br/>YoY Change: %s",
SP_ID, Den2014, Den2015, scales::percent(YoY_2015, accuracy = 0.1)
), HTML)
# Labels appear on hover
label_2013 <- ~scales::percent(YoY_2013, accuracy = 0.1)
label_2014 <- ~scales::percent(YoY_2014, accuracy = 0.1)
label_2015 <- ~scales::percent(YoY_2015, accuracy = 0.1)
# Load and transform the neighborhood shapefile
town_map_raw <- st_read("Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp")
## Reading layer `Merida_Den12_13_14_15' from data source
## `/Users/amanshaik/Documents/GitHub/dengueAnalysisTechWriting/Merida_S25/Merida_Den12_13_14_15/Merida_Den12_13_14_15.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 540 features and 6 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 212287.2 ymin: 2309495 xmax: 236386 ymax: 2333181
## Projected CRS: Mexico ITRF2008 / UTM zone 16N
town_map_wgs84 <- st_transform(town_map_raw, crs = 4326)
# Create the interactive leaflet map
m <- leaflet(data_sf_wgs84) %>%
# Add base map layers
addProviderTiles(providers$CartoDB.Positron, group = "Light Map") %>%
addProviderTiles(providers$Esri.WorldImagery, group = "Satellite") %>%
# Add Neighborhoods layer
addPolygons(
data = town_map_wgs84,
group = "Neighborhoods",
fillOpacity = 0.1,
color = "#444",
weight = 1,
opacity = 0.7,
popup = ~~SP_ID
) %>%
# Set the map to focus on the data
fitBounds(
~as.numeric(st_bbox(geometry)[1]),
~as.numeric(st_bbox(geometry)[2]),
~as.numeric(st_bbox(geometry)[3]),
~as.numeric(st_bbox(geometry)[4])
) %>%
# ==== Layer 1: 2013 vs 2012 ====
addCircleMarkers(
group = "2013 vs 2012",
color = ~pal_yoy(YoY_2013),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2013,
label = label_2013
) %>%
# ==== Layer 2: 2014 vs 2013 ====
addCircleMarkers(
group = "2014 vs 2013",
color = ~pal_yoy(YoY_2014),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2014,
label = label_2014
) %>%
# ==== Layer 3: 2015 vs 2014 ====
addCircleMarkers(
group = "2015 vs 2014",
color = ~pal_yoy(YoY_2015),
fillOpacity = 0.8,
stroke = FALSE,
radius = 5,
popup = popup_2015,
label = label_2015
) %>%
# ==== Legend ====
addLegend(
pal = pal_yoy,
values = cap_range,
title = "YoY % Change",
position = "bottomright",
labFormat = labelFormat(suffix = "%", transform = function(x) x * 100),
opacity = 1
) %>%
# ==== Layer Control ====
addLayersControl(
baseGroups = c("2013 vs 2012", "2014 vs 2013", "2015 vs 2014"),
overlayGroups = c("Light Map", "Satellite", "Neighborhoods"),
options = layersControlOptions(collapsed = FALSE)
)
## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2013): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2014): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA
## Warning in pal_yoy(YoY_2015): Some values were outside the color scale and will
## be treated as NA
# Display the map
m